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ABSTRACT 

We discuss whether modern machine learning methods can be used to characterize 
the physical nature of the large number of objects sampled by the modern multi-band 
digital surveys. In particular, we applied the MLPQNA (Multi Layer Perceptron with 
Quasi Newton Algorithm) method to the optical data of the Sloan Digital Sky Survey 
- Data Release 10, investigating whether photometric data alone suffice to disentangle 
different classes of objects as they are defined in the SDSS spectroscopic classification. 
We discuss three groups of classification problems: (i) the simultaneous classification 
of galaxies, quasars and stars; (ii) the separation of stars from quasars; (in) the sep¬ 
aration of galaxies with normal spectral energy distribution from those with peculiar 
spectra, such as starburst or starforming galaxies and AGN. While confirming the 
difficulty of disentangling AGN from normal galaxies on a photometric basis only, 
MLPQNA proved to be quite effective in the three-class separation. In disentangling 
quasars from stars and galaxies, our method achieved an overall efficiency of 91.31% 
and a QSO class purity of ~ 95%. The resulting catalogue of candidate quasars/AGNs 
consists of ~ 3.6 million objects, of which about half a million are also flagged as robust 
candidates, and will be made available on CDS VizieR facility. 

Key words: methods:data analysis - techniques:photometric - catalogues - galax¬ 
ies: active - quasars:general 


1 INTRODUCTION 

Broad band photometry from wide-field imagers mounted 
on dedicated telescopes and instruments has been and will 
continue to be our main source of information for a large 
fraction of the extragalactic universe. Spectroscopy provides 
a more detailed and deeper understanding of the physical 
properties of individual objects than photometry. However, 
spectroscopy will never be able to fully sample the popula¬ 
tions of galactic or extragalactic objects in either number or 
depth. It is therefore of great interest to determine whether 
it is possible to extract estimates of physical parameters, 
such as distance, metallicity, star formation rate, morphol¬ 
ogy and the presence or absence of an active nucleus, from 
the coarse information provided by photometry. 

For a given object, photometric quantities, such as mag¬ 
nitudes in different bands, momenta of the light distribution 
and morphological indexes, define its position in a high di¬ 
mensional parameter space wh ich we shall call the Ob served 
Parameter Space (or OPS, cf. iDiorgovski et alj|20l3) . The 
de-projection of the OPS into the Physical Parameter Space 
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(PPS, i.e. the parameter space defined by the physical quan¬ 
tities), is however a complex operation, made in some cases 
almost impossible by the degeneracy existing in both the 
data and the physical parameters themselves. Most of the 
time, such de-projections require an intermediate passage, 
i.e. the de-projection of the OPS onto the Spectroscopic Pa¬ 
rameter Space (or SPS), i.e. the space defined by observable 
spectroscopic quantities such as redshift, equivalent widths 
of specific spectral lines and spectroscopic indexes. The pa¬ 
rameters defining the SPS are in fact more directly related to 
the intrinsic physical properties of the objects. Spectroscopy, 
for instance, is more effective than photometry in disentan¬ 
gling normal galaxies from those hosting an AGN, or in clas¬ 
sifying different types of AGN in broad classes (e.g., Seyfert, 
LINERS, etc.). Actually, the definition of nearly all AGN 
classes is based on spectroscopic criteria thro ugh the equiv¬ 
alent widths of some lines (see for instance iKewlev et al.l 
l200ll : [Lamareill3l20ld l. 

The problem of mapping the OPS onto some subspaces 
of the SPS and then on the PPS, is frequently encountered in 
the literature. Examples include the determination of pho¬ 
tometric redshifts, which is an attempt to reproduce what is 
decidedly a spectroscopic quantity (the redshift) using pho- 
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tometry alone; the search for candidate quasars and blazars 
using photometric diagnostics; or the search for candidate 
AGN and starburst galaxies. 

Usually these mapping functions are quite complex and 
difficult (if not plainly impossible) to be derived in sim¬ 
ple analytical forms and require statistical or empirical ap¬ 
proaches which are common in the field of Machine Learn¬ 
ing (hereinafter ML). In fact, if higher accuracy information 
is available for a subset of objects (in the case discussed 
here, spectroscopic parameters or flags in the SPS), it can 
be used to teach a ML method how to map the available 
data (the photometric data in the OPS) into the higher ac¬ 
curacy ones. In the ML field this approach is usually called 
supervised learning. We emphasize that, while supervised 
learning is very powerful in uncovering the hidden relation¬ 
ship between input parameters and the so-called knowledge 
base (hereafter KB), there are two main limitations: 

• the mapping function cannot be extrapolated outside 
of the region of the OPS that is properly sampled by the 
SPS; 

• any bias present in the KB is necessarily reproduced (if 
not amplified) in the output. 

To better exemplify these two points let us focus on the 
main topic addressed in t his work where we use a specific ML 
method, the MLPQNA dBrescia et al.l [20131 : ICavuoti et al.l 
l2014alh to tackle a particular incarnation of the so called 
Physical classification of galaxies problem. We will specif¬ 
ically address the possibility of disentangling normal, non¬ 
active galaxies from those hosting an AGN using only pho¬ 
tometric measurements. 

In a previous paper we have discussed the use of 
the same method t o classify different types of AGN 
dCavuoti et al.ll2014al '). In this work we shall instead focus 
on the question of whether it is possible to use only optical 
colors to produce reliable (or at least statistically well con¬ 
trolled) AGN candidates and to disentangle normal-inactive 
galaxies from those hosting an A GN. _ _ 

As already pointed out in ICavuoti et al.l (l2014a| j . in 
spite of the uniq ue physical mechanism responsible for th e 
nuclear activity dAntonuccil 1 19931 : lUrrv fe Padovanil Il995h , 
the phenomenological complexity of AGNs is so high that 
there cannot be a unique method equally effective in iden¬ 
tifying all AGN phenom enologies in every redshift range 
(cf. iMessias et aid 120101 1. Moreover, AGN and starbursts, 
long studied separately, are now thought to be correlated 
dSchweitzer et af]|20od : [Pilbratt et al.l[2oToh and difficult to 
disentangle without using mid-IR data. 

Furthermore, we also investigate the possibility of iden¬ 
tifying a catalogue of candidate QSOs, by comparing the 
efficiency of classification in terms of a compromise between 
purity and completeness following two different strategies: i) 
by exploiting a self-consistent strategy starting from three 
different classes of objects (for instance galaxies, stars and 
QSOs); ii) by assuming a pre-determined separation be¬ 
tween resolved and un-resolved objects (star/galaxy), we 
performed the usual two-class classification between stars 
and QSOs. 

The paper is structured as follows. In Section [2] we dis¬ 
cuss in same detail the data. In Section [3] we summarize the 
general methodology adopted and provide a short descrip¬ 
tion of the ML method used. In Section[4]we describe in some 


detail the various classification experiments performed and, 
ffnalfy, in Section [5] we discuss the results. 

2 THE DATA 

Amo ng the countless merits of the Sloan Di gital Sky Sur¬ 
vey (lYork et alj l2000l : IStoughton et al.l 120021 4 , there is also 
the fact that it has paved the way to the experimenting 
and wide adoption within the astronomical community of 
innovative methods based on ML (or on statistical pattern 
recognition), thus fo stering the birth of the emerging field 
of Astroinformatics (lBornal2010l 'l . 

The SDSS in fact provides an ideal test ground for ML 
algorithms and methods: a large and complex photometric 
database with hundreds of features measured for hundreds 
of millions of objects (defining the SDSS subregion of the 
OPS), complemented by spectroscopic information for a sig¬ 
nificant subsample (roughly ~ 1%) of objects (i.e. the SPS). 
Furthermore, it offered to different groups, working with dif¬ 
ferent methodoiogies, the possibility to address significant 
problems, thus allowing a robust and fair assessment of the 
performance of any new method. 

Even constraining ourselves to the specific interest of 
the present work, it would be difficult to summarize the ap¬ 
plications of ML methods to SDSS data and we shall just 
mention two. The s earch for candidate quasars has been 
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J2014al l. Last but not least, ML methods have been crucial 
for the first suc cessfu l astronomica t citizen science project: 
the galaxy Zoo dLintott et al.ll2008h . where empirical meth¬ 
ods were used to compare and evaluate the performance of a 
large number of human simple, repetitive, and independent 
classification tasks. 

In this paper we use photometric and spectro scopic data 
extracted from the SDSS Data Release 10 (DR10: lAhn et ahl 
120141 ; lEisenstein et al.l 1201 ll ). The DR10 photometry cov¬ 
ers 14, 555 deg 2 of the celestial sphere for a total of more 
than 469 millions unique (i.e. without duplicates, overlaps 
and repeated measurements) objects, but not necessarily 
unique astrophysical objects. DR10 also includes spectro¬ 
scopic information for more than 3 million objects. These 
data come from a wide range of concurrent experiments: the 
7 initial SDSS data releases, the Sloan Extensi on for Galac¬ 
tic U nderstanding and Exploration fSEGUE-2. lYannv et al.l 
l2009tl . the Apache Po int Observatory Gala xy Evolution Ex¬ 
periment (APOGEE; iMaiewski et al.ll2014l l mainly targeted 
to Milky Way stars; the Bar ion O scillation Spectroscopic 
Survey 1BOSS: [Dawson et al.l [~2013l 4 and, finally, the Multi 
Object APO Radial Velocit y Exoplanet Large-Area Survey 
(MARVELS; iGe et ahlUioih . 

In order to build our KB we used the spectroscopic 
classification flags provided by the SDSS DR10. It includes 
also a spectroscopic classification (parameters CLASS and 
SUBCLASS) obtained by the SDSS team through the 
comparison of individuai spectra with templates and from 
equivalent width ratios. This classification is articuiated in 
three main classes: GALAXY , QSO and STAR as follows: 
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GALAXY . Objects having a galaxy template. Its sub¬ 
classes are: 

• STARFORMING: based on whether the galaxy has 
detectable emission lines that are consistent with star- 
formation according to the criteria: 

loglO ( OIII/Hp) < 0.7 - 1.2 (log 10 ( NII/H a ) + 0.4) 

• STARBURST: set if the galaxy is star-forming but has 
an equivalent width of H a greater than 50 A; 

• AGN: based on whether the galaxy has detectable emis¬ 
sion lines that are consistent with being a Seyfert or LINER 
according to the relation: 

loglO (OIII/Hp) > 0.7 - 1.2 ( loglO ( NII/H a ) + 0.4) 

QSO. Objects identified with a QSO template. Based on 
the SDSS definition, if a galaxy or quasar has spectral lines 
detected at the 10 sigma level with a > 200km/sec at the 
5 sigma level, the indication BROADLINE is appended to 
their subclass. 

STAR. Objects matching a stellar template (among 26 
spectral subclasses). Since we were not interested in these 
subclasses, they will be ignored in what follows. 

The total number of spectroscopic objects is 3, 079,151 
divided in different types as summarized in Table [I] We 
assumed all galaxies not flagged otherwise to be normal, i.e. 
neither AGN or starburst. 

It is worth noting that sharp cuts based on crisp thresh¬ 
olds introduce ambiguities in the mapping of the SPS onto 
the PPS. To be more clear: a large fraction of the objects 
with spectroscopic features near the threshold values may 
be misclassified (i.e a galaxy hosting an AGN may be er¬ 
roneously put in the normal galaxy bins due spectroscopic 
errors and viceversa). These ambiguities are intrinsic to the 
KB and will affect the projection of the OPS onto the SPS 
and therefore the PPS. 

For all spectroscopic objects we downloaded the sets of 
magnitudes listed in the upper part of Table(2] The choice of 
using different sets of magnitudes was dictated by the fact 
that, since we were interested also in finding AGN candi¬ 
dates, different apertures can be used to weight in different 
ways the contribution of the central unresolved source and 
of the extended surrounding galaxy. 

We performed a preliminary photometric filtering at 
the query time, by using the primary mode and by tak¬ 
ing into account the DR10 prescriptions encapsulated by 
the flags calibStatus and clean. Besides these flags, any 
use of the produced candidate QSOs should be care- 
fully investigated in terms of the ir photometric reliability 
dPalanaue-Delabrouille et al.ll2013l '). 

Finally, the catalogue had to be cleaned for an handful 
of objects which appear to be duplicated in the main SDSS- 
DR10 archive. 


3 THE METHOD 

Supervised machine learning methods used for classification 
tasks require an extensive KB formed by objects for which 
the outcome of the classification (i.e. the target) is a-priori 


known. The methods learn from these examples the un¬ 
known and often complex rule linking the input data (in 
this case the photometric parameters) to the target (in this 
case a physical class known from the KB). Since these meth¬ 
ods cannot be used for extrapolating knowledge outside the 
KB ranges, it is crucial to keep in mind that the regions 
of the OPS adequately sampled by the KB define also the 
domain of applicability of the method. 

The KB is usually split in three different subsets to 
be used for training, validation and test, respectively. The 
training set is used by the method to learn the mapping 
function, the validation set is mainly used to avoid the so 
called overfitting, i.e. the loss of generalization capabilities, 
and a blind test set is used to evaluate the performance of 
the method, by exposing the trained network to objects in 
the KB which have never been seen before by the network 
itself. 

Evaluation of the results with the a-priori known val¬ 
ues of the objects in the blind test set allows evaluation of 
a series of statistical indicators and parametrization of the 
performance of the method itself. 

An alternative approach, which is also the one used in 
this work, is the so called leave-one-out k-fold cross val- 
idati on which can be implicitly performed during train¬ 
ing Is eisse jnnzi). The automatized process of the cross- 
validation consists of performing k different training runs 
with the following procedure: (*) random splitting of the 
training set into k random subsets, each one composed by 
the same percentage of the data set (depending on the k 
choice); ( ii ) at each run the remaining part of the data set 
is used for training and the excluded fraction for validation. 
While avoiding overfitting, the k-fold cross validation leads 
to an increase of the execution time of ~ k — 1 times the 
total number of runs. 

In all experiments listed in the following section we used 
the MLPQNA method and a 5 — fold cross-validation. 

3.1 The model MLPQNA 

DAMEWARE (D Ata Mining fc Exp loration Web Appli¬ 
cation REsource; iBrescia et al.l 1201414 is an infrastructure 
which offers to anyone the possibility of engaging in complex 
data mining tasks by means of a web-based approach to a 
variety of data mining methods. Among the methods avail¬ 
able, f or the present wor k we used a Multi Layer Perceptron 
( - MLP: lRosenblattlll96ll j neural network which is among the 
most used feed-forward neural networks in a large variety 
of scientific and social contexts. MLPs may differ widely in 
both architecture and learning rules. In this work, we used 
the MLPQNA model, i.e. a MLP implementation (Fig. [TJ 
where the learning rule is based on the Quasi Newton Algo¬ 
rithm (QNA). 

The QNA relies on Newton’s method to find the stationary 
(i.e. the zero gradient) point of a function and the QNA 
algorithm is an optimization of the basic Newton learning 
rule based on a statistical approximation of the Hessian of 
the training error obtained through a cyclic gradient cal¬ 
culation. MLPQNA makes use of the well known L-BFGS 
algorith m (Limited mem ory - Broyden Fletcher Goldfarb 
Shanno; iBvrd et al.ll 19941 4 which was originally designed for 
problems with a very large number of features (hundreds to 
thousands). 
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CLASS 

SUBCLASS 

Nr. of obj. 

Total Nr. 


AGN 

22,589 



BROADLINE 

18,629 



STARBURST 

73,166 


GALAXY 

STARFORMING 

265,704 

385,298 


AGN BROADLINE 

3,505 



STARBURST BROADLINE 

166 



STARFORMING BROADLINE 

1,539 



NORMAL 


1,526,215 



sub-total 

1,911,513 


AGN 

970 



BROADLINE 

257,572 



AGN BROADLINE 

2,954 


QSO 

STARBURST BROADLINE 

9,127 

271,603 


STARFORMING BROADLINE 

552 



STARBURST 

315 



STARFORMING 

113 



NORMAL 


97,695 



sub-total 

369,298 

STAR 



798,340 


Table 1. Partition of the SDSS-DR10 spectroscopic objects into spectroscopic classes and subclasses. Column 1: gives the broad spectral 
type; column 2: spectral subtype; column 3: number of objects belonging to that specific subclass; column 4: Total number of objects in 
a given class. 


type 

parameters 

Identification 

obj ID, specObjID 

Coordinates 

RA, DEC 

psfMag 

ugriz magnitudes and related errors mag_err 

fiberMag 

ugriz magnitudes and related errors mag^err 

modelMag 

ugriz magnitudes and related errors magjerr 

cmodelMag 

ugriz magnitudes and related errors mag.err 

deredMag 

ugriz magnitudes 

extinction 

ugriz extinction values 

spectroscopic redshift 

z, zWarning 

classification 

type, class, subclass, flags 


Table 2. Description of the parameters (features and targets) used in this work. The first part of the table lists the photometric 
parameters in the OPS, the second one the spectroscopic data and targets. Column 1: collective name of a given set of parameter; 
column 2: the corresponding SDSS parameters. 


The MLPQNA method has been extensively discussed else¬ 
wher e in the contexts of both classif i cation dBrescia et al. 
2012h and regress ion dBrescia et al.l 120131 ; ICavuoti et al.l 
20151 . 120121 . l2014bl ): a first attempt to use MLPQNA for 
a similar problem, namely the classification of e mission line 
galaxies, was presented in ICavuoti et al.l (l2014aT l. 


4 EXPERIMENTS 

We now describe the main classification experiments which 
were performed. Since the general astronomer may not be 
familiar with the ML methodology and could be confused in 
dealing with the following sections, we wish to emphasize a 
few points. 

Most of ML methods are not deterministic and there is 
no clear-cut rule to optimize the parameters of the models. 
This is particularly true in the astronomical case, due to 


classification degeneracy in some parts of the OPS. Thus it 
is often not possible to establish a-priori which combination 
of input parameters is optimal for a given task. This implies 
that, in order to find the optimal method, many experi¬ 
ments with different settings need to be run and evaluated. 
A full understanding of the performance of a specific exper¬ 
iment can be achieved only through the comparison of the 
outcomes of a large number of experiments, each run with 
different settings. 

Our experiments can be divided into three main fami¬ 
lies: 

• two-class experiments for disentangling normal galaxies 
from other types (normal - AGN/QSO). 

• two-class experiment to disentangle quasars from stars 
(QSO - STAR); 

• three-class experiments (STAR - GALAXY - QSO); 

Before entering into a detailed description of these 
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OUTPUT 


Class A Class B 

TARGET Class A N AA N A b 

Class B N BA Nbb 


Table 3. Confusion matrix for a two class experiment. N AA is 
the number of objects in the class A correctly classified; N A b 
number of objects belonging to Class A erroneously classified in 
Class B; Nb A : number of objects belonging to Class B erroneously 
classified as belonging to Class A; Nbb '■ number of objects in class 
B which were correctly classified. 


three groups of experiments, it is useful to list the criteria 
which were adopted in order to reduce the number of 
experiments. 


Input features and targets. As input features we used 
the 5 psfMag and the 5 modelMag magnitudes provided 
by the SDSS in the u,g,r,i,z photometric system. As tar¬ 
get vectors we used the spectroscopic taxonomy provided 
by SDSS and discussed in Sec. [2] Followi ng a similar ap¬ 
proach but using a different ML method, lAbraham et al.l 
(12012 ~l found that better performance can be obtained by 
using 10 colors, obtained from all possible combinations of 
the SDSS psfMag magnitudes, plus the uband PSF magni¬ 
tude. Although the information in 10 colors is in principle 
degenerate (i.e., could be described with jus t 5 magnitudes), 
in ord er to perform a direct comparison with lAbraham et al.l 
d2012h . from the psfMag we derived 10 colors. Furthermore, 
we also performed a few experiments on a subregion of the 
OPS defined by a cut in the i magnitude which was forced 
to fall in the range [14, 24], plus the following cuts in colors: 

• —0.25 < u — g ^ +1.00; 

• —0.25 < g — r ^ +0.75; 

• +0.30 < r — i ^ +0.50; 

• —0.30 < i — z < +0.50. 


The performance of the experiments were evaluated as per¬ 
centages of the standard statistical indicators: overall effi¬ 
ciency, completeness, purity and contamination. With refer¬ 
ence to the confusion matrix in Table [3] 


• Overall Efficiency etot : defined as the ratio between the 
number of correctly classified objects and the total number 
of objects in the data sets. With reference to Table [3] 


O-tot 


N A a + Nbb 

Naa + Nab + Nba + Nbb 


( 1 ) 


• Completeness Ci : defined as the ratio between the num¬ 
ber of correctly classified objects in a class and the total 
number of objects of the same class present in the data set. 
Always with reference to class A: 

s n^VnTb (2) 


• Purity Pi: defined as the ratio between the number of 
correctly classified objects in a class and the total number 
of objects classified in that ckass. For instance, for class A: 

= N^rk-A (3) 

• Contamination Coe defined as the dual of purity, 
namely as the ratio between the misclassified objects in a 



Figure 1. The typical topology of a feed-forward neural network, 
in this case representing the architecture of MLPQNA. In the 
example there are two hidden layers between the input (X) and 
output (Y) layers, corresponding to the architecture mostly used 
in the case of complex problems. 


KB 

step 

Normal 

Others 

total 

ALL 

Training 

77,692 

76,970 

154,662 


Test 

311,236 

308,323 

619,559 

CUTS 

Training 

77,004 

75,915 

152,919 


Test 

308,512 

304,488 

613,300 


Table 4. Characteristics of the KBs used to perform the Normal 
galaxies vs others experiments (from Sec. 14.ID . 


given class and the number of objects classified in that class. 
With reference to class A: 


Co a = 1 Pa = 


Nba 

Naa + Nba 


( 4 ) 


Finally, all experiments were performed using a 2-layers 
MLPQNA model, i.e. a complex feed-forward architecture 
including two hidden layers of neurons, besides the input 
and output layers (Fig. [TJ. 


4.1 Two-class experiment: normal galaxies vs 
others 

As mentioned before, a first set of experiments was per¬ 
formed on the galaxy subset of the SDSS DR10 catalogue. 
The main goal of these experiments was to evaluate whether 
optical photometry could be used to distinguish between 
normal galaxies and galaxies with a peculiar spectrum. This 
is particularly important for the new surveys, which are ex¬ 
pected to expand the amount of known AGNs, by probing 
them rather than QSOs only. 

We therefore built a KB, based on the 10 magnitudes 
(psfMag and modelMag) available in all SDSS bands, using 
the objects listed in the upper side of Table [T] with also an 
additional flag, set to 0 for the normal galaxies and 1 for all 
other types. 

With reference to Table [l] we see that in the whole KB 
there are 1, 911, 513 objects among which 1, 526, 215 normal 
galaxies and 385, 298 objects belonging to the other classes. 
In order to equally weight the two groups, we randomly ex¬ 
tracted ~ 25% of the galaxies obtaining a smaller data set. 
We also cleaned the KB by removing Not-a-Number (NaN) 
entries. The resulting data set (hereafter named ALL) con¬ 
tained 619, 559 objects among which 311, 236 were flagged 
as normal galaxies. Furthermore, in order to exclude objects 
falling in the poorly populated parts of the OPS, we also cre¬ 
ated a second data set by applying some cuts in magnitudes, 
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Experiment 

% €tot 

% C normal 

% Pnormal 

% C others 

% ^others 

ALL 

89.50 

90.20 

89.03 

88.79 

89.97 

CUTS 

89.59 

90.44 

89.05 

88.73 

90.15 


Table 5. Summary of the performance on the blind test sets for the MLPQNA experiments on the ALL and CUTS KBs. The values 
are calculated with equations from [T] to [3] 


i.e. by removing the very low significant tails of the magni¬ 
tude distributions, consisting of ~ 1% of the original data. 
This second KB will be referred to as CUTS in what follows. 
The properties of the two different KBs and the way they 
were split into training and test set, are listed in Table [4] 

The main results of the classification with the 
MLPQNA model are summarised in Table[5] while in Tables 
land[7] are listed, respectively, the misclassified objects for 
the ALL and for the CUTS experiments, divided according 
to the SDSS spectroscopic subclasses. 

The first thing to be noted from the results of Table 1 is 
that the performance of the MLPQNA on the two different 
data sets differ only by a negligible amount, although with 
a slightly better behavior in the CUTS. Nonetheless we con¬ 
sidered the ALL experiment as the best one, because it does 
not require any restriction on the OPS. This induced us to 
exclude any cut in further experiments. 

Looking at Tables [6] and [3 which summarize the results 
of these experiments, one notices that all objects labeled as 
broadline are poorly classified (with contamination ranging 
from 69% to 30%). This fact induced us to check whether a 
different partition of the spectroscopic subclasses could lead 
to some improvements in the performance. 

We therefore performed two additional experiments on 
the ALL data set. In the first case by grouping the objects 
in two broad classes: NORMAL galaxies plus BROADLINE 
types vs others (hereafter NBonly experiment); in the second 
case by grouping together all the BROADLINE types vs 
others (hereafter NBall experiment). 

Finally, always in order to minimize the number of mis¬ 
classified objects, an additional two — class classification ex¬ 
periment was performed by splitting the data set in two 
classes: one containing NORMAL, BROADLINE, AGN and 
AGN BROADLINE, and the other including all remaining 
types (hereafter experiment NBA). 

The results for the three experiments are summarized 
in Table mi while the percentage of misclassified objects in 
the three cases are reported, respectively, in the Table [5] for 
NBonly, Table [9] for NBall and Table flOl for NBA. 

4.2 Two-class experiment: QSO vs Star 

The identification of quasar candidates on photometric 
grounds is a topic of the highest relevance. In order to 
evaluate the best possible combination of input features 
and model architecture, we run several experiments but, 
given the large computational load, it was not advisable 
to run the experiments on the whole KB and therefore, 
for these exploratory experiments, we used a reduced KB 
randomly extracted from the original one. Furthermore, in 
terms of taxonomy of all the combinations among magni¬ 
tudes and colors, we were also interested in evaluating per¬ 
formance in comparison with the experiment types described 


in lAbraham et all d2012| j. Hence, only colors have been used. 
The performed experiments were indeed composed by the 
following cases: 

a) two experiments with 4 colors and one magnitude, re¬ 
spectively, u and r (hereinafter named as, respectively, 2a-u 
and 2a-r); 

b) as with 2a, but with al l same cuts in magnitudes and 
colors as in lAbraham et ahl (|2012| j. (hereinafter named as, 
respectively, 2b-u and 2b-r); 

c) as with 2a, but with only a cut in t he i magnitude, 
b y tak ing into account what was done by lAbraham et al.l 
(l2012|j . (hereinafter named as, respectively, 2c-u and 2c-r); 

d) as with 2a, but using 10 colors, obtained by combina¬ 
tions of psfMag magnitudes, (hereinafter named as, respec¬ 
tively, 2d-u and 2d-r); 

e) as with 26, extended to 10 colors, (hereinafter named 
as, respectively, 2e-u and 2e-r); 

f) as with 2c, extended to 10 colors, (hereinafter named as, 
respectively, 2f-u and 2f-r). 

The results of the experiment are reported in Table fl2l 
In statistical terms, the experiments with the i cut resulted 
comparable to those without any cut. Hence, we considered 
the latter group as the best. These results will be further 
discussed in Sect. [5] 

4.3 Three-class experiments 

This set of experiments aimed at reproducing on photomet¬ 
ric grounds the SDSS spectroscopic classification in the three 
main classes (STAR, GALAXY, QSO). We therefore per¬ 
formed our three — class experiments using: 

a) 10 magnitudes (hereinafter named as 3a), composed by 
the five magModel plus the five psfMag, without any pho¬ 
tometric cut, in order to properly weight the contribution 
from the nuclear regions; 

b) 10 magnitudes composed by the five magModel and 
the five psfMag with a cut in the i magnitude (hereinafter 
named as 3b); 

c) 10 colors from psfMag type and a magModel magni¬ 
tude, respectively, u and r (hereinafter named as, respec¬ 
tively, 3c-u and 3c-r); 

d) 5 magnitudes, alternately from psfMag and magModel 
(hereinafter named respectively as 3d-psf and 3d-model), in 
order to evaluate the single contribution of the two types. 

In all cases the training set was randomly extracted by 
balancing the number of examples presented to the network 
for each class, while the given dataset has been always ran¬ 
domly split into a training and a blind test sets, by using 
percentages of respectively, 12% and 88%. 

The results of the three — class experiments are re¬ 
ported in Table 1131 By looking at the QSO statistics, the 
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Class 

Subclass 

N. objects 

N. objects 

misclassified 



Subclass 

class 

% 

NORMAL 



311,236 

10 


AON 

18,119 


41 


BROADLINE 

14,936 


67 


STARBURST 

58,371 


2 

OTHERS 

STARFORMING 

212,748 

308,323 

7 


AGN BROADLINE 

2,798 


47 


STARBURST BROAD 

133 


14 


STARFORMING BROAD 

1,218 


30 


Table 6. Summary of misclassified objects in the ALL experiment. Column 1: SDSS spectroscopic class; column 2: SDSS spectroscopic 
subclass; column 3 and 4: number of objects in the subclass and in the class; column 5: fraction of misclassified objects. 


Class 

Subclass 

N. objects 

N. objects 

misclassified 



Subclass 

class 

% 

NORMAL 



308,512 

10 


AGN 

17,971 


41 


BROADLINE 

14,356 


69 


STARBURST 

56,922 


2 

OTHERS 

STARFORMING 

211,238 

304,488 

7 


AGN BROADLINE 

2,688 


48 


STARBURST BROAD 

132 


15 


STARFORMING BROAD 

1,181 


31 


Table 7. Summary of misclassified objects in the CUTS experiment. Columns as in table [6] 


Class 

Subclass 

N. objects 

N. objects 

misclassified 



Subclass 

class 

% 

NORMAL 

NORMAL 

278,213 

292,850 

9 

+ BROADLINE 

BROADLINE 

14,637 


9 


OTHERS 

AGN 

STARBURST 

STARFORMING 

AGN BROADLINE 
STARBURST BROAD 
STARFORMING BROAD 

17,750 

56,443 

206,853 

2,715 

133 

1,210 

285,104 

46 

2 

7 

67 

17 

41 

Table 8. Summary of misclassified objects 

in the NBonly experiment. Columns as 

in table [6] 


Class 

Subclass 

N. objects 

N. objects 

misclassified 



Subclass 

class 

% 


NORMAL 
+ ALL 
BROADLINE 

NORMAL 

BROADLINE 

AGN BROADLINE 
STARBURST BROAD 
STARFORMING BROAD 

277,977 

14,626 

2,764 

127 

1,197 

296,691 

9 

7 

31 

84 

51 


AGN 

17,633 


48 

OTHERS 

STARBURST 

56,484 

280,657 

2 


STARFORMING 

206,540 


7 


Table 9. Summary of misclassified objects in the NBall experiment. Columns as in table [6] 
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Class 

Subclass 

N. objects 

N. objects 

misclassified 



Subclass 

class 

% 

NORMAL 

NORMAL 

236,847 


8 

+ BROADLINE 

BROADLINE 

14,356 

272,700 

5 

+ AON 

AON 

18,119 

43 

+ AON BROAD 

AGN BROADLINE 

2,798 


23 


STARBURST 

58,371 


2 

OTHERS 

STARFORMING 
STARBURST BROAD 

212,748 

133 

272,470 

9 

9 


STARFORMING BROAD 

1,218 


54 


Table 10. Summary of misclassified objects in the NBA experiment. Columns as in table |6] 


Experiment 

% etot 

class 

% completeness 

% purity 

NBonly 

91.03 

NORMAL+BROADLINE 

91.25 

91.06 

OTHERS 

90.80 

90.99 

NBall 

91.11 

NORMAL+ALL BROADLINE 

91.07 

91.57 

OTHERS 

91.14 

90.62 

NBA 

91.17 

NORMAL+BROADLINE+ALL AGN 

89.88 

92.27 

OTHERS 

92.46 

90.13 


Table 11. Summary of the statistical performance on the two-class experiments named as NBonly, NBall and NBA. The columns 2 and 
5 are calculated with equations from[T]to[3| 


EXP ID 

% 6tot 

Star %C S 

Star %P S 

QSO %C Q 

QSO %Pq 

2a-u 

91.24 

87.76 

95.23 

95.11 

87.48 

2a-r 

91.30 

87.92 

95.18 

95.05 

87.63 

2b-u 

88.06 

78.71 

98.72 

98.04 

81.19 

2b-r 

88.13 

79.73 

98.92 

98.21 

81.96 

2c-u 

91.22 

87.20 

95.68 

95.56 

87.13 

2c-r 

91.15 

86.25 

96.47 

96.53 

86.47 

2d-u 

91.34 

88.09 

95.06 

94.93 

87.81 

2d-r* 

91.42 

88.08 

95.23 

95.11 

87.82 

2e-u 

89.15 

80.52 

98.39 

98.56 

82.28 

2e-r 

89.05 

80.10 

98.61 

98.77 

82.05 

2f-u 

91.42 

88.49 

95.23 

94.83 

87.61 

2f-r 

91.43 

88.67 

95.16 

94.68 

87.63 


Table 12. Summary of the results of the two-class experiments (from Sec. 14.21) . where the best is indicated with the asterisk (referenced 
in the text as 2d-r) referred to the parameter space composed by the 10 colors without any cut for each object. The training and test 
sets are respectively 10% and 90% of the given dataset. The last five columns are referred to equations from ^ to [3] All the quantities 
reported in the table are percentages. 


EXP ID 

%etot 

Galaxy %Cg 

Galaxy %Pg 

QSO %Cq 

QSO %Pq 

Star VoCs 

Star VoPs 

3a* 

91.31 

97.02 

93.49 

90.49 

86.90 

86.40 

93.82 

3b 

91.02 

96.96 

93.49 

88.95 

86.84 

87.19 

92.89 

3c-u 

87.83 

92.69 

88.00 

88.27 

85.56 

82.57 

90.21 

3c-r 

87.77 

92.64 

88.03 

88.42 

85.60 

82.27 

89.93 

3d-psf 

86.62 

90.73 

86.58 

87.94 

83.28 

81.23 

90.65 

3d-model 

87.64 

93.60 

87.32 

88.13 

85.76 

81.23 

90.17 


Table 13. Summary of the results of the three-class experiments (from Sec. m The best one is with the asterisk (referenced in the 
text as 3a), referred to the parameter space composed by the 10 magnitudes (psfMag and magModel) for each object. The training 
and test sets are respectively 12% and 88% of the given dataset. The columns are referred to equations from [I] to |3] All the quantities 
reported in the table are percentages. 
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purity/completeness results of 3a experiment appeared very 
similar to those obtained in the 36 experiment. Although 
these two experiments led to comparable results, we con¬ 
sider the 3a as the best, because obtained without any cut 
in the parameters. For a complete discussion see Sect. m 


5 DISCUSSION 

The classification experiments described in the previous 
sections aimed at producing a reliable catalogue of candi¬ 
date QSOs/AGNs by exploiting the photometric informa¬ 
tion available within the DR10. In what follows we give a 
detailed overview of their results. 

5.1 Galaxy versus others 

The set of experiments described in Sec. l4.1l aimed at investi¬ 
gating whether optical photometry could be used to isolate 
different types of galaxies as defined by the spectroscopic 
subclasses in the SDSS DR10 spectroscopic archive. With 
reference to Table [I] it is apparent that different types of 
objects are represented in the KB in a very uneven way. In 
the first couple of experiments ( ALL and CUTS) we tried to 
separate normal galaxies from those having a peculiar spec¬ 
trum (i.e. all the other subclasses in the SDSS classification 
scheme). The performance of the method, shown in Tabled 
were not affected by the application of color and magnitude 
cuts and the efficiency remained stationary around a value 
of ~ 90%. Given the lack of improvements between the ALL 
and CUTS experiments, we decided to perform supplemen¬ 
tary experiments by freezing the ALL KB, the same config¬ 
uration for the network and the input parameters. 

The statistics of misclassified objects, reported in Ta¬ 
bles [6] and [7] show that the MLPQNA is quite effective in 
disentangling starburst and starforming galaxies both nar¬ 
row and broadline (misclassification rates of 2% and 7%, re¬ 
spectively), and much less effective in identifying other types 
of objects such as, for instance, AGN (misclassification rate 
41%) and AGN broadline objects (~ 47%). The performance 
on the starburst broadline and starforming broadline types, 
however, are difficult to evaluate, since these two groups are 
very poorly represented in the KB and therefore the net¬ 
work may have failed in capture their specific signatures in 
the OPS. 

The fact that 41% of the AGN, 67% of the broadline 
and 47% of the AGN broadline were misclassified induced us 
to check whether different groupings of the subclasses could 
lead to improvements in the classification efficiency. 

In the NBonly experiment (Table [8]) the misclassifi¬ 
cation rate improved to 9% for the normal galaxies and 
dropped to 9% (from 67%) for the broadline. The misclas¬ 
sification rates for AGN broadline, starburst broadline and 
starforming broadline increased, confirming the fact that the 
network is not capable to disentangle these spectroscopic 
subclasses in an effective way. In the NBall experiment (Ta¬ 
ble O the results did not change in a significant way. In 
fact, the normal, starburst and starforming type objects 
still maintain a misclassification rate of, respectively, 9%, 
2% and 7%. While for the broadline galaxies the misclassi¬ 
fication decreases to 7%. A slightly better result is obtained 
for the AGN broadline type, going from 47% in ALL and 


67% in NBonly, to 31%. The AGN and starforming broad¬ 
line types are even worse reaching a misclassification of, re¬ 
spectively 48% and 51%. Finally, the starburst broadline, 
although sparsely represented in the data set, resulted al¬ 
most completely misclassified (84%). 

The last experiment (NBA), resulted as the best one 
in terms of overall efficiency as well as for classes normal, 
broadline, AGN broadline and starburst broadline, where 
the misclassification rates decreased to, respectively, 8%, 
5%, 23% and 9%. Only starburst type rate remains un¬ 
changed (2%), while AGN objects perform slightly worse 
(43% vs 41% obtained in the ALL case). On the other hand, 
the classification of starforming and starforming broadline 
type decreases to, respectively, 9% and 54%, although the 
starforming rate decreases by 2% only. The strong decrease 
for the starforming broadline type could be due to its poor 
presence within the data set. 

5.2 QSO versus Stars 

Already in the earlie st SDSS works dStoughton et all 120021 : 

I Richards et al. 200ll optical colors were u sed to disentangle 
stars f rom quasars in the SDSS color space. ID’Abrusco et al.1 
(l2009ll used an unsupervised clustering algorithm followed 
by an agglomeration phase to identify candidate QSOs in a 
para meter space based only on photometric c olors. 

(ISinha et al] 120071 : lAbraham et al.l l2012f) used a Dif¬ 
ference Boosting Neural Network (DBNN) on SDSS opti¬ 
cal colors only, achieving excellent performance (~ 98%) in 
disentangling QSO from normal galaxies in the unresolved 
source catalogues (thus including stars, quasars and unre¬ 
solved galaxies). They however applied a cut in colors by 
isolating the region of this space where most (84%) of the 
spectroscopically confirmed quasars lay. This choice, how¬ 
ever, penalizes the recognition of peculiar objects which are 
likely to lay outside of the main distribution of normal well 
behaved objects. 

Our experiments are described in Sec. and related 
results are shown in Table M The immediate conclusion 
which can be drawn by analyzing the results is that our 
method is able to achieve ~ 95% of completeness and a 
purity of ~ 88% without having to apply any color cuts. 

5.3 Star - Galaxy - QSO 

In the three-class classification experiment the KB included 
all objects in the spectroscopic catalogue regardless their 
extended or unresolved nature. 

The first thing to notice is that all experiments reported 
in Table [Till have performance which differ only by very lit¬ 
tle amount. The experiment which led to the best results 
was the 3a with an overall efficiency of ~ 91%. This ex¬ 
periment was performed using a 2 layers MLPQNA and a 
quite small training set (only 12% of the available KB) us¬ 
ing as input parameters the 10 SDSS magnitudes. Having 
identified the experiment 3a as the best one, two additional 
experiments were performed on the 3a parameter space, by 
changing either the topology of the neural network or the 
amount of training set to evaluate their individual contri¬ 
bution to the performance. In the first case we used a 1 
layer MLPQNA and in a second case a larger training set 
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(60% of the KB). The case with the increased training set 
led to a negligible improvement in the overall efficiency but 
at the price of a huge increase in computing time. While, 
in the case of a single layer MLPQNA network, the per¬ 
formance decreased of about 1%. This can be understood 
by taking into account that the complexity introduced by 
the second hidden layer in the architecture of the neural 
network is always justified when the KB samples the OPS 
only sparsely. In the case discussed here, even with a re¬ 
duced KB, the number of samples in the training set was 
sufficiently high to ensure a proper coverage of the OPS and 
to minimize the requirements on generalization capabilities. 
The additional complexity introduced by the second layer 
proves indeed very u seful in presence of much smaller KBs 
(ICavuoti et al.ll2012r ). Surprisingly enough, the introduction 
of mag i cuts in the parameter space (experiment 3 b) did not 
lead to any significant improvement in the method efficiency. 

In summary, using optical data only, the MLPQNA 
seems quite effective in identifying galaxies (completeness 
97.02% and contamination 6.51%) from QSO and stars (Ta¬ 
ble [Till . 

We performed the supervised learning classification ex¬ 
periment to produce a catalogue of high fidelity candidate 
QSO selected on photometric data only. The classification is 
based on a membership probability criterium: for each ob¬ 
ject the model output is based on three different member¬ 
ship pseudo-probabilities (a confidence level for each class). 
Therefore the values in Table m are referred to these con¬ 
fidence levels for all objects of the blind test set (technique 
also known as winner-takes-all: iGrossberdl 19731 ). With this 
method, in the case of the selected best experiment, we 
reached a QSO class purity of ~ 87%. Afterwards, since our 
goal was to reach the highest level of purity in the produced 
catalogue, we performed a further statistical analysis of the 
test set starting from the results of the experiment 3a, by as¬ 
sessing the variation of purity vs completeness as a function 
of the increasing conhdence threshold used to evaluate the 
QSO candidates from the trained MLPQNA model output. 
At the end we reached a purity of ~ 95% in the blind test 
set, at the price of a reduced completeness. The resulting 
QSO photometric catalogue contains 3,602,210 candidates 
and will be made publicly available through the CDS VizieR 
facility. 

We wish to emphasize that a three-class classification 
approach offers a significant advantage with respect to the 
traditional approach based on serialized series of two-class 
classification steps, since it does not introduce systematic 
biases due to objects which are misclassified at each step of 
the sequence. Specifically: the separation among star, galax¬ 
ies and quasars could be achieved via two independent clas¬ 
sification steps: first separating resolved (galaxies) from un¬ 
resolved objects (stars and quasars) and then dividing the 
latter group into stars and quasars. Objects misclassified at 
the first step would be propagated into the next step. 

Restricting ourselves to the QSO vs STAR classifica¬ 
tion, in the three-class experiment we achieve an overall 
accuracy of 91.5%, with a QSO purity of 88.4% and QSO 
completeness of 95.3%. Therefore, by comparing this result 
with those of the two class experiment (Table 1121) there is 
a noticeable improvement of accuracy in the three-class ex¬ 
periment. This confirms the fact that a three-class approach 
is preferable to the hierarchical chain of two-class experi¬ 


ments. The slight decrease in QSO completeness found in 
the three-class experiment may be simply due to the higher 
separation complexity of the experiment. 

It needs to be stressed, however, that the catalogue of 
candidate quasars needs to be used with some cautions. In 
fact, no a-priori defined photometric cuts have been applied 
to the data and our operative definition of a quasar was in¬ 
duced only by the properties of the spectroscopic knowledge 
base. Hence, any bias, potentially present in the KB, would 
be reflected in the final definition of the catalogue. 

Let us, for instance, take into account the bright end of 
the luminosity distribution. It is well known that the SDSS 
quasar catalogue is fair ly (~ 95%) complete for i < 19.1 
(IWinchatz fc Andersonl[20071 : [Ross et al.ll2012i '>. If we con¬ 
sider the produced catalogue, within the flux limit of i < 
19.1 we find ~ 50% more candidates than those present in 
the spectr oscopic sampl e. But, as already discussed in the 
literature rtCroom et al] 120091 ). this bright end is strongly 
contaminated by stars, UVX sources and, mainly, by narrow 
emission-line galaxi es. In order to isolate luminous quasars, 
ICroom et alj j200St l used a complex system of photometric 
cuts: 

A) u-g< 0.8 fj g-r< 0.6 fj r-i< 0.6; 

B) u-g> 0.6 f| g-i> 0.2; 

C) u-g> 0.45 f| g-i> 0.35; 

D) galprob> 0.99 fj u-g> 0.2 fj g-r> 0.25 fj r -i< 0.3; 

E) galprob> 0.99 fj u-g> 0.45. 

These cuts were indeed combined through the logical ex¬ 
pression [A fj B fj C fj D fj E\ , to carve the optimal quasar 
locus in the observed parameter space. 

If we apply the above conditions to both spectroscopic 
KB and the final catalogue of candidate quasars, under the 
flux limit of i < 19.1 the overabundance of objects in the 
produced catalogue is reduced of the ~ 17%. 

In other words, while ML methods prove very effective 
in partitioning the OPS accordingly to the information con¬ 
tained in the spectroscopic KB, the correct interpretation of 
their output needs to be fine tuned using the expert’s (i.e. 
the astronomer’s) knowledge. For this reason, in order to al¬ 
low interested readers to apply their own filters to the data, 
the catalogue contains all the relevant photometric informa¬ 
tion. 

In Fig. [2] we compare the number of candidate QSOs 
(solid black line) in our catalogue vs the number of objects 
in the KB which are spectroscopically identified as QSOs 
(solid grey line), bot h befo re and after (dashed lines) the 
cuts of Croom et al.l ()2009l j . Before the cuts, the excess of 
spectroscopically confirmed AGNs/QSOs at bright magni¬ 
tudes (i < 17) can be easily explained by the fact that low 
luminosity AGNs, which are spectroscopically identified, can 
easily escape photometric detection since the relative weight 
of the AGN contribution is negligible with respect to the 
contribution of the central regions of the galaxy. Further¬ 
more, in this magnitude range the SDSS sample is highly un¬ 
balanced, thus causing the MLPQNA model to be less accu¬ 
rate in terms of training performanc e on the QSO class. Af¬ 
ter applying the ICroom et alj d2009|j selection criteria, how¬ 
ever, the agreement between the two distributions becomes 
remarkably good and it is maintained down to the com¬ 
pleteness limit of the SDSS spectroscopic sample (i = 19.1). 
Beyond such limit, the two distributions begin to differ due 
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i band (psfMag) 

Figure 2. Logarithmic distribution of the number of objects per square degree as a function of psfMagJ magnitude, before (solid 
line) and after (dashed line) the quality cut specified by Croom et al. (2009). Black lines refer to our catalogue, gray lines to the SDSS 
spectroscopic sample. The solid black line represents the resulting catalogue of candidate quasars/AGNs, consisting of ~ 3.6 million 
objects. While the dashed black line indicates the ~ 0.5 million objects flagged as robust candidates. 


to both new candidate quasars which were not spectroscopi¬ 
cally confirmed in the SDSS (region between the two dashed 
lines) and to a contamination from low/medium luminosity 
AGNs (region between the two black lines) which, as it has 
been discussed in Sec. EU cannot be disentangled on the 
grounds of optical photometry only. 

The trends depicted in the diagram of Fig. [2] find a sta¬ 
tistical confirmation in Fig. [I] where we show the results o f 
the ID Kolmogorov-Smirnov (K-S) test (lLedermannlll983h 
graphically represented through the cumulative psfMagJ 
band distribution functions. The test has been performed to 
compare the distributions of QSO/AGN candidate sources, 
respectively, between the photometric catalog and the spec¬ 
troscopic KB, before and after the completene ss limit of the 
SDSS spectroscopic sample (i = 19.1) and the lCroom et al.1 
(l2009l) cuts application. Within the completeness limit, the 
K-S test confirms the similarity of the two distributions, 
perfec tly overlapping after the application of lCroom et al] 
(120091 ) cuts. Beyond the completeness limit, the K-S test 
makes evident a difference between the two distributions, 
considered reasonable by taking into account the incom¬ 
pleteness of the spectroscopic sample. 

In the produc ed catalogue, a quality flag has been in¬ 
cluded to take the lCroom et al.1 (120091 ) cuts into considera¬ 
tion. 


6 CONCLUSIONS 

The main scope of the present work was to investigate the 
possibility of disentangling different spectral classes of ob¬ 
jects, by exp loiting the machine learning based model named 
MLPQNA dBrescia et al.l I 2 OI 2 I ). In order to reach these 
goals, three categories of experiments have been performed. 

In a first group of experiments we investigated the pos¬ 
sibility to disentangle different spectroscopic types present 
in the GALAXY class. All experiments, summarized in the 


Tables [5] and M were unable to separate AGNs from nor¬ 
mal galaxies on the grounds of optical data only, although 
the overall e fficiency is always ~ 9 0%, thus confirming what 
we found in lCavuoti et al.l (12014al ). By excluding the AGN 
class, in all other cases where the number of representative 
objects in the KB is sufficiently large, the disentangling ap¬ 
pears quite feasible, in particular by considering results of 
the experiment NBA (Table [TUI) . 

The second one concerned distinguishing between QSO 
and STAR class es. The results, shown i n Table H2l confirmed 
the findings of lAbraham et all (I2012I) . that the use of 10 
colors, rather than the standard 4 colors, although without 
carrying a great amount of additional information, may help 
machine learning methods to find better solutions. 

The third category focused on identifying candidate 
QSOs from the whole catalogue including also stars and 
galaxies, thus permitting to release the candidate QSO pho¬ 
tometric catalogue. In this case, the possibility of avoiding 
the known downside of hierarchical pairwise classification 
(multiplicative propagation of misclassihcation), induced an 
intrinsic advantage of a direct classification of QSOs from 
the whole catalogue. The resulting catalogue of QSO can¬ 
didates contains 3,602, 210 objects, of which 529, 923 are 
flagged as robust candidates, accordi ng to the quality flag 
introduced to take into account the lCroom et all (120091) se¬ 
lection criteria. This catalogue will be made publicly avail¬ 
able through the CDS VizieR facility. As discussed in the 
text, the catalogue requires to be filtered according to the 
specific needs of the user. 
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CDF of bright distributions before Croom et al. cuts 
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Figure 3. Results of the ID Kolmogorov-Smirnov test to compare the psfMagS band distributions of QSO/AGN candidates between 
the photometric catalogue (red line) and the spectroscopic KB (black line). The upper side diagrams show the tests done for brighter 
sources (i < 19.1), respectively, before (upper left) and after (upper right) the cuts of Croom et al. (2009). The lower side reports the 
same type diagrams for fainter sources (i ^ 19.1). The K-S test resulted positive in the two cases shown in the upper side diagrams, 
confirming the similarity at the 95% of significance level within the spectroscopic completeness limit. In particular, the two distributions 
appear perfectly overlapped in the upper right diagram. 
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